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Abstract 

This  paper  introduces  a  new  evaluation  func¬ 
tion  for  solving  the  multiple  instance  problem. 

Our  approach  makes  use  of  the  main  idea  of  di¬ 
verse  density  (Maron,  1998;  Maron  &  Lozano- 
Perez,  1998)  but  finds  the  best  concept  using 
the  chi-square  statistic.  This  approach  is  sim¬ 
pler  than  diverse  density  and  allows  us  to  search 
more  extensively  by  using  properties  of  the  con¬ 
tingency  table  to  prune  in  a  guaranteed  man¬ 
ner.  We  demonstrate  that  this  approach  solves 
the  multiple-instance  problem  as  well  as  or  better 
than  diverse  density  and  that  the  pruning  mecha¬ 
nism  allows  chi-squared  to  identify  the  best  con¬ 
cepts  more  quickly. 

1.  Introduction 

Multiple  instance  learning  (MIL)  is  a  useful  and  well- 
known  technique  for  learning  with  ambiguous  or  partially 
labeled  data.  For  example,  Dietterich  et  al.  (1997)  proposed 
the  task  of  identifying  what  part  of  a  molecule  binds  with  a 
musk  receptor  in  the  nose.  A  molecule  can  take  on  a  num¬ 
ber  of  different  shapes,  or  conformations,  but  it  is  likely  that 
only  one  of  those  shapes  will  bind  with  a  musk  receptor.  It 
is  difficult  to  determine  which  shape  actually  matched  the 
musk  receptor  although  it  is  possible  to  observe  the  pres¬ 
ence  of  a  reaction  for  a  given  molecule  (and  thus  a  set  of 
shapes).  The  task  is  to  identify  which  shape  and  part  of  the 
molecules  causes  them  to  smell  musky.  The  data  available 
for  a  MIL  agent  is  in  the  form  of  labeled  sets  of  instances, 
e.g.,  a  set  of  shapes  with  a  single  label.  Supervised  learn¬ 
ing  uses  individually  labeled  instances  which  are  difficult 
to  obtain  for  many  tasks.  Labeling  each  instance  in  the  set 
with  the  label  for  the  set  produces  too  much  noise  for  su¬ 
pervised  learning  whereas  MIL  techniques  are  designed  to 
learn  from  exactly  this  type  of  data. 

More  specifically,  a  MI  learner  uses  labeled  bags  where  a 
bag  is  a  collection  of  instances  with  one  label  for  the  entire 
collection.  A  positive  bag  contains  at  least  one  instance  of 


Table  1 .  Contingency  table  used  to  identify  the  best  concept  from 
a  set  of  positive  and  negative  bags  using  the  chi-squared  statistic. 
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the  target  concept  while  a  negative  bag  contains  none.  An 
instance  is  a  point  in  feature  space  and  the  target  concept  is 
another  point  in  feature  space.  The  goal  is  to  find  a  concept 
that  explains  the  labels  for  the  bags  and  can  predict  labels 
for  unseen  data.  It  is  not  known  in  advance  which  instance 
is  the  one  causing  the  bag  to  be  labeled  as  positive.  If  this 
were  known,  a  supervised  learning  approach  could  be  used 
instead. 

Successful  applications  of  MIL  include  such  tasks  as  rec¬ 
ognizing  MUSK  molecules  (Dietterich  et  al.,  1997),  pre¬ 
dicting  stock  trends  (Maron,  1998),  image  retrieval  tasks 
(Goldman  et  al.,  2002;  Maron  &  Ratan,  1998),  and  iden¬ 
tifying  useful  subgoals  (McGovern  &  Barto,  2001a;  Mc¬ 
Govern,  2002).  One  of  the  main  MI  techniques  is  that 
of  diverse  density  (Maron,  1998;  Maron  &  Lozano-Perez, 
1998).  This  technique  is  widely  used  because  of  its  intu¬ 
itive  appeal  of  its  central  idea:  that  the  best  concept  is  the 
concept  closest  to  the  intersection  of  the  positive  bags  and 
farthest  from  the  union  of  the  negative  bags.  Because  the 
nearby  positive  instances  are  from  different  bags,  Maron 
denotes  this  idea  as  diverse  density.  Regular  density  algo¬ 
rithms  would  not  take  into  account  the  different  bags  and 
would  focus  only  at  the  instance  level  density.  In  spite  of 
the  intuitive  appeal  of  diverse  density,  computing  the  di¬ 
verse  density  evaluation  function  is  computationally  expen¬ 
sive. 

We  present  an  alternative  evaluation  function  that  uses  the 
chi-square  statistic.  This  is  simpler  to  define  and  imple¬ 
ment  and  it  allows  a  guaranteed  pruning  method.  In  partic¬ 
ular,  we  use  the  contingency  table  shown  in  Table  1.  The 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
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rows  of  the  table  correspond  to  the  predicted  label  for  the 
bag  and  the  columns  of  the  table  correspond  to  the  labels 
on  the  training  bags.  Using  the  molecule  example,  if  the 
current  hypothesis  predicts  two  of  three  positive  bags  as 
positive  and  three  of  four  negative  bags  as  negative,  then 


the  contingency  table  could  be  filled  out  as: 


2 

1 

1 

3 

The 


concept  with  a  maximal  chi-squared  value  will  have  most 
of  its  mass  concentrated  along  the  main  diagonal  and  thus 
will  be  correctly  predicting  the  most  positive  and  negative 
bags.  This  correlates  with  the  main  idea  of  diverse  density 
by  correctly  predicting  as  many  positive  bags  as  possible 
while  also  predicting  negative  bags.  The  use  of  the  contin¬ 
gency  table  of  this  form  enables  us  to  also  introduce  a  guar¬ 
anteed  pruning  method  similar  to  (Oates  &  Cohen,  1996; 
Webb,  1995).  We  discuss  the  pruning  method  in  more  de¬ 
tail  below. 


2.  Notation 


Maron  assumes  that  the  bags  are  conditionally  independent 
given  the  target  concept,  which  allows  the  likelihood  to  be 
rewritten  as: 


n  Pr(B+\c)  n  Pr{B7\c).  (3) 

\<i<n  1  <i<m 


However,  it  is  still  not  possible  to  calculate  this  exactly 
without  a  model  of  how  the  bags  were  generated.  Instead, 
one  can  use  Bayes’  Rule  again  to  rewrite  Expression  3  as: 


n 
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Substituting  Expression  4  into  Equation  2  and  omitting  the 
terms  that  do  not  depend  on  c,  the  task  reduces  to  that  of 
finding  the  maximum  likelihood  over  concepts  as  follows: 

n  Pr(c\B+)  n  Pr(c\B7). 

1  <i<n  1  <i<m 


For  the  MI  notation,  we  follow  that  of  Maron  (1998)  and 
Maron  and  Lozano-Perez  (1998).  The  set  of  positive  bags 
is  denoted  B+  and  the  ith  positive  bag  is  /i;  .  Likewise,  the 
set  of  negative  bags  is  denoted  B  and  the  /th  negative  bag 
is  BJ .  If  the  discussion  applies  to  both  types  of  bags,  we 
drop  the  superscript  and  refer  to  it  as  B.  The  j th  instance  of 
the  /th  bag  is  denoted  B,t.  The  target  concept  is  denoted  c, 
and  other  concepts  as  c. 


3.  Diverse  density 


We  first  review  diverse  density  and  then  examine  how  our 
method  relates.  We  use  diverse  density  as  the  baseline 
for  comparison  in  the  experimental  results.  Maron  and 
Lozano-Perez  (1998)  and  Maron  (1998)  define  the  diverse 
density  of  a  concept,  c,  to  be: 


DD(c)  =  Pr(c\B+,...,B+,B^,...,B~),  (1) 


where  Pr(c)  is  the  probability  that  c  is  the  correct  concept, 
n  is  the  number  of  positive  bags,  and  m  is  the  number  of 
negative  bags.  The  output  of  a  DD  search  is  the  concept 
with  a  maximal  DD  value.  To  perform  this  search,  we  must 
expand  Equation  1.  Using  Bayes’  Rule,  Equation  1  can  be 
rewritten  as: 


DD(c) 
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Assuming  a  uniform  prior  probability  over  the  target  con¬ 
cepts  and  noticing  that  the  denominator  is  constant  with 
respect  to  the  concept,  finding  the  concept  with  the  maxi¬ 
mum  DD  value  reduces  to  finding  the  maximum  likelihood 
of  the  positive  and  negative  bags  given  a  specific  concept: 

Pr(B^,...,B+,BJ,...,Bm\c). 


It  remains  to  determine  the  probability  of  an  instance  in 
a  bag  causing  the  concept  to  be  correct,  Pr(c\Bj).  Maron 
discusses  several  ways  to  do  this.  We  follow  his  suggestion 
of  using  a  noisy-or  model  (Pearl,  1988),  in  which  case  we 
have: 


Pr(c\B+) 
Pr(c\BJ ) 


1  -  11  (1  -Pr(B±£c)),  and  (5) 
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II  (l-Pr(B7jec)),  (6) 
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where  p  is  the  number  of  instances  in  bag  //,. 

The  only  part  of  Equations  5  and  6  that  is  undefined  is  the 
probability  of  a  particular  instance  belonging  to  the  target 
concept:  Pr(Bjj  G  c).  This  can  be  defined  in  several  ways. 
For  the  results  presented  here,  we  use  Maron’s  single  point 
concept  class  which  he  defines  using  a  Gaussian  probability 
distribution.  The  concept,  c,  is  assumed  to  be  a  k  dimen¬ 
sional  vector  and  the  probability  of  an  instance  belong  to 
the  concept  class  is: 


Pr(Bij  G  c) 


exp 


k  ( J^ijl  cl ) 


(7) 


where  6,-y;  is  the  /th  feature  of  the  /'th  instance  of  the  /th 
bag,  ci  is  the  /th  feature  of  concept  c,  Z  is  a  scaling  fac¬ 
tor,  and  02  is  the  standard  deviation.  The  standard  devia¬ 
tions  o/  and  02  can  be  chosen  separately  for  positive  and 
negative  bags  to  allow  the  two  types  of  evidence  to  have 
different  amounts  of  influence  on  the  DD  values. 


Note  that,  in  practice,  it  is  more  accurate  to  calculate  the 
log  likelihood  of  the  diverse  density,  rather  than  calculating 
the  diverse  density  directly,  because  of  accuracy  issues  with 
very  small  floating  point  numbers. 


3.1.  Identifying  a  concept  with  maximal  DD 

Concepts  with  maximal  DD  values  can  be  found  using  stan¬ 
dard  search  techniques.  If  it  is  feasible,  e.g.,  if  the  space  is 
small  and  can  be  viewed  as  a  discrete  space,  exhaustive 
search  can  be  used.  In  larger  spaces,  Maron  used  a  gra¬ 
dient  based  search  approach  with  multiple  restarts  where 
each  run  of  the  gradient  search  started  from  a  random  point 
in  a  positive  bag.  This  heuristic  is  useful  because  the  MI 
framework  already  restricts  the  true  concept  to  appear  only 
in  the  positive  bags.  However,  gradient  methods  have  sev¬ 
eral  drawbacks.  The  first  is  that,  even  with  the  heuristics 
on  starting  locations,  gradient  methods  can  become  stuck 
in  local  maxima.  Second,  they  require  the  function  being 
maximized  to  be  differentiable.  While  this  is  true  for  the 
concepts  that  Maron  proposed  such  as  the  single  point  con¬ 
cept  and  for  the  linear  concept  class  (McGovern  &  Barto, 
2001b),  it  is  not  true  for  other  types  of  concepts  such  as 
relational  graphs  (McGovern  &  Jensen,  2003). 

One  alternative  to  the  gradient  search  for  diverse  density 
was  proposed  by  Zhang  and  Goldman  (2002).  They  com¬ 
bined  the  expectation-maximization  framework  with  the 
search  for  a  concept  with  maximal  DD  value.  This  ap¬ 
proach  yields  high  performance  but  is  again  computation¬ 
ally  expensive.  Another  alternative  in  large  spaces  is  a  sim¬ 
ple  random  search  technique  (Rosenstein  &  Barto,  2001). 
This  is  the  search  method  that  we  use  for  the  experiments 
presented  in  this  paper  where  exhaustive  search  is  not  feasi¬ 
ble.  In  this  case,  the  search  starts  from  a  number  of  random 
locations  such  as  randomly  chosen  instances  from  the  pos¬ 
itive  bags,  and  the  search  proceeds  in  a  manner  similar  to 
genetic  algorithms.  The  random  search  method  can  still  be¬ 
come  stuck  in  local  maxima  but  it  can  be  used  on  functions 
that  are  not  differentiable. 

4.  Chi-squared 

We  propose  an  alternative  way  to  identify  the  best  concept 
in  an  MI  setting  that  is  based  on  the  main  idea  of  diverse 
density  but  relies  on  the  chi-squared  statistic.  This  is  a  well 
known  statistic  with  a  known  sampling  distribution.  The 
contingency  table  used  to  calculate  chi-squared  can  also  be 
used  to  prune  the  search. 

The  main  idea  of  diverse  density  is  that  the  best  concept  is 
the  concept  that  is  at  the  intersection  of  the  positive  bags 
and  that  is  far  away  from  the  union  of  the  negative  bags. 
Using  the  contingency  table  shown  in  Table  1,  chi-squared 
identifies  the  concepts  that  predict  the  most  positive  bags  as 
well  as  the  most  negative  bags.  This  is  related  to  the  most 
diversely  dense  concept  although  it  does  not  use  the  idea  of 
being  far  away  from  the  negative  bags.  The  rows  of  the  ta¬ 
ble  correspond  to  the  predicted  label  from  the  concept  and 
the  columns  correspond  to  the  actual  labels  for  the  training 


bags.  Assuming  a  method  for  labeling  the  bags  given  a  pro¬ 
posed  target  concept,  the  table  is  filled  out  in  the  following 
manner.  If  the  concept  predicts  that  the  bag  will  be  posi¬ 
tive  and  it  is  positive,  a  is  incremented.  If  the  prediction  is 
positive  but  the  bag  is  really  negative,  b  is  incremented.  If 
the  prediction  is  negative  and  the  bag  is  positive,  c  is  incre¬ 
mented.  Finally,  d  is  incremented  if  the  concept  predicts 
negative  and  the  bag  is  negative. 

Chi-squared  is  calculated  by  summing  the  squared  differ¬ 
ences  for  the  expected  values  in  each  cell  of  the  contin¬ 
gency  table  versus  the  observed  values.  Let  oa,  <?/,,  oc,  and 
o(i  be  the  observed  values  for  the  cells  of  the  table  shown 
in  Table  1  and  let  ea,  e/,,  ec,  and  e,j  be  the  expected  values 
for  each  of  the  cells.  The  expected  values  are  calculated  by 
multiplying  the  row  sum  by  the  column  sum  and  dividing 
by  the  total  number  of  elements.  The  chi-squared  statistic 
is  calculated  as: 

x2  =  E 

i£a,b,c,d  * 

The  best  concept  is  defined  as  that  with  the  highest  chi- 
squared  value.  Chi-squared  will  be  maximal  in  two  cases: 
when  the  mass  is  concentrated  along  the  main  diagonal 
(e.g.,  in  a  and  d)  and  when  the  mass  is  concentrated  along 
the  off-diagonal  (e.g.,  in  b  and  c).  In  the  first  case,  the 
proposed  concept  is  correctly  predicting  a  maximum  num¬ 
ber  of  positive  and  negative  bags,  which  is  the  overall  goal. 
In  the  second  case,  the  concept  is  predicting  exactly  the 
opposite  of  this  goal.  This  is  a  well-known  issue  with  the 
chi-squared  statistic  and  the  signed  chi-squared  statistic  ad¬ 
dresses  this  issue.  We  define  the  best  concept  to  have  a 
maximal  signed  chi-squared  value.  Signed  chi-squared  is 
positive  if  the  mass  is  on  the  main  diagonal  and  negative  if 
it  is  on  the  off-diagonal. 

There  are  several  advantages  to  calculating  the  chi-squared 
statistic  over  calculating  diverse  density.  The  first  is  it  is 
both  simpler  to  calculate  as  well  as  less  computationally 
complex  for  search.  Second,  this  approach  can  be  used 
for  concept  spaces  that  are  not  differentiable  or  are  not 
able  to  be  defined  clearly  in  the  diverse  density  framework. 
Maron’s  approach  assumes  that  the  DD  equation  is  differ¬ 
entiable  which  means  that  the  probability  Pr(Bjj  £  c)  must 
be  differentiable.  Although  this  is  generally  true  in  a  flat 
feature  space,  it  is  not  true  for  relational  data  yet  we  can 
successfully  apply  the  chi-squared  technique  to  such  tasks 
(McGovern  &  Jensen,  2003).  A  third,  and  very  important, 
advantage  of  the  chi-squared  technique  over  other  MI  tech¬ 
niques  is  that  the  chi-squared  approach  enables  a  guaran¬ 
teed  pruning  mechanism.  This  means  that  the  signed  chi- 
squared  approach  should  be  more  effective  in  larger  spaces 
or  with  a  limited  amount  of  time  to  search  because  it  can 
search  more  thoroughly  in  the  same  amount  of  time. 


4.1.  Pruning 


Any  search  technique  that  can  make  use  of  pruning  can 
be  used  to  identify  the  best  chi-squared  concept.  Pruning 
works  in  a  manner  very  similar  to  that  of  Oates  and  Cohen 
(1996)  and  Webb  (1995).  In  particular,  assume  a  concept 
x  is  being  proposed  as  a  target  concept  and  that  the  con¬ 


tingency  table  for  this  concept  has  values: 


a 

b 

c 

d 

.  This 


contingency  table  can  be  evaluated  to  give  a  chi-squared 
value  but  it  can  also  be  used  to  find  the  maximum  possi¬ 
ble  chi-squared  value  for  a  concept  based  on  x  but  that  is 
more  specific  than  x.  A  more  specific  concept  could  be  one 
where  a2  or  a2  is  smaller  or,  for  a  graphical  concept,  one 
where  the  graph  has  more  nodes  or  edges. 
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A  more  specific  concept  is  unable  to  match  more  bags  than 
the  original  concept,  so  the  mass  in  the  contingency  table 
is  restricted  to  move  from  the  top  row  (i.e.  positive  pre¬ 
dictions)  to  the  bottom  row.  Also,  since  the  columns  of 
the  table  are  the  labels  on  the  bags  in  the  training  set,  the 
mass  can  not  move  from  one  column  to  another  without 
the  training  data  being  relabeled.  With  these  restrictions, 


a  concept  based  on  concept  x  would  have  a  maximal  chi- 
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-squared  value  will  actually 

imized  and  the  concept  can  be  immediately  pruned  as  it  is 
predicting  exactly  the  opposite  of  the  correct  bag  labels. 
In  the  second  case,  if  the  maximum  possible  value  that  a 
concept  based  on  x  can  reach  is  worse  than  the  best  con¬ 
cept  seen  so  far  in  the  search,  then  x  can  be  pruned  and  the 
search  moved  to  a  new  area. 

4.2.  Labeling  the  bags 

The  cells  in  the  contingency  table  are  filled  by  predicting 
the  label  on  each  of  the  training  bags  using  the  proposed 
target  concept.  Using  the  DD  definitions,  there  are  sev¬ 
eral  ways  to  do  this  that  could  make  sense.  The  first  is  to 
calculate  the  probability  that  an  instance  is  in  the  concept, 
Pr{Bjj  £  c),  for  each  instance  in  bag  Bj  and  to  take  the  max¬ 
imum  value,  e.g., 

labelmax(fi;|c)  =  max  £  c) 

1  <j<p 

In  this  case,  a  label  of  one  is  a  prediction  of  a  positive  bag 
and  a  label  of  zero  is  a  prediction  of  a  negative  bag.  A 
related  approach  could  use  the  sum  of  these  probabilities: 

labeladd(B'lc)  =  E  Pr(B‘J  G  c) 

i<;<p 

Other  approaches  that  might  seem  useful  that  are  based  on 
the  product  or  the  sum  of  the  logarithms  of  Pr(Bjj  £  c) 
(similar  to  the  calculations  that  Maron  defines  for  Pr(c\Bj )) 
do  not  work  well  in  practice. 


Figure  1.  Sample  artificial  data  set  where  there  are  five  positive 
bags  and  five  negative  bags.  Positive  instances  are  labeled  with 
a  number  corresponding  to  the  bag  and  negative  instances  are 
shown  as  dots.  The  rectangle  shows  the  target  concept.  This  data 
set  is  derived  from  (Maron,  1998). 


Using  the  labelmax  approach,  the  cells  of  the  table  are 
filled  out  as  follows.  Given  a  proposed  concept  x  and  a 
positive  bag  Bt  ,  a  is  incremented  with  labelmax  {Bf  ,x) 
and  c  by  1  —  labelmaxCB^  ,*).  For  negative  bags, 
b  is  incremented  with  labelmax  {BJ ,  x)  and  d  with 
1  —  labelmax (BJ ,x).  For  a  set  of  bags  B+  and  B  and 
a  proposed  concept  x,  the  contingency  table  becomes: 


E™!  labelmax  (£+,*) 

ZUi  labelmax  (B,-  -x) 

1  -  labelmax  {B+  ,x) 

E"=t  1  -  labelmax  (#,r ’x) 

The  table  for  label^^is  constructed  in  a  similar  manner. 


Under  any  labeling  technique  of  this  form,  when  the  con¬ 
cept  correctly  predicts  the  most  positive  and  negative  bags, 
the  mass  in  the  contingency  table  will  be  concentrated 
along  the  main  diagonal.  This  means  that  the  concept  will 
have  a  maximal  chi-squared  value.  Each  incorrect  predic¬ 
tion  will  decrease  the  signed  chi-squared  value  by  adding 
mass  off  diagonal. 

In  the  next  section,  we  compare  diverse  density  and  the 
signed  chi-squared  approaches  more  directly. 

5.  Experimental  results 

We  compare  the  behavior  of  the  diverse  density  and 
signed  chi-squared  methods  under  varying  conditions  us¬ 
ing  Maron’s  ’difficult  artificial  data  set’  (Maron,  1998).  We 
use  this  data  set  because  it  allows  us  to  control  many  as¬ 
pects  of  the  data  available  to  the  two  MI  learners  while  still 
presenting  a  challenging  task.  The  feature  space  is  two  di¬ 
mensional  and  each  feature  is  a  real  number  in  the  range 
[0, 100].  The  target  concept  is  a  5x5  square  in  the  middle  of 
the  space  surrounding  the  point  (50,50).  Bags  are  created 
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Figure  2.  Accuracy  of  the  diverse  density  and  signed  chi-squared 
methods  with  a  varying  number  of  positive  and  negative  bags. 
These  numbers  are  averaged  over  60  different  runs. 


by  uniformly  sampling  50  points  from  the  space  and  adding 
these  instances  to  each  bag.  If  any  point  falls  inside  the  tar¬ 
get  concept,  then  the  bag  is  labeled  as  positive,  otherwise 
the  bag  is  labeled  as  negative. 

Figure  1  shows  an  example  data  set  with  five  positive  bags 
and  five  negative  bags.  This  task  is  difficult  not  only  be¬ 
cause  of  the  type  of  data  available  but  also  because  of  the 
target  concept  itself.  The  concept  that  the  MI  learner  is 
searching  for  is  a  point  in  feature  space  while  the  actual 
shape  of  the  concept  is  a  square.  The  single-point  con¬ 
cept  class  surrounds  each  point  with  a  Gaussian-like  sphere 
but  this  means  that  points  inside  a  square  may  be  missed, 
depending  on  the  radius  of  the  sphere.  This  makes  the 
task  more  challenging  than  if  we  randomly  generated  target 
points  and  used  a  sphere  around  them  to  define  the  target 
concept. 

For  the  first  experiments,  we  repeat  Maron’s  measurements 
of  accuracy  using  diverse  density  with  a  varying  number 
of  positive  and  negative  bags.  We  also  measure  accu¬ 
racy  using  the  chi-squared  method  under  the  same  condi¬ 
tions.  In  this  case,  we  discretized  the  feature  space  us¬ 
ing  unit-sized  grid  squares  and  computed  the  negative  log- 
likelihood  diverse  density  values  and  the  chi-squared  val¬ 
ues  for  each  point  in  the  discretized  space.  We  then  identi¬ 
fied  the  best  concepts  for  both  methods.  Accuracy  is  mea¬ 
sured  by  checking  if  the  best  concept  fell  within  the  target 
region.  If  this  was  the  case,  that  run  received  a  score  of 
one  and  zero  otherwise.  We  measured  the  average  accu¬ 
racy  over  60  different  runs,  each  starting  with  a  different 
random  seed  to  create  the  bags.  We  varied  the  number  of 
positive  bags  from  three  to  twenty  and  the  number  of  nega¬ 
tive  bags  from  one  to  twenty.  The  average  accuracy  results 
are  shown  in  Figure  2.  The  chi-squared  results  shown  in 
this  figure  used  the  labelmax  approach.  The  two  methods 
have  comparable  accuracies  for  the  same  number  of  posi¬ 
tive  and  negative  bags,  which  is  expected. 

It  is  difficult  to  tell  from  these  graphs  if  either  approach  is 
outperforming  the  other  by  a  small  but  significant  amount. 
To  determine  this,  we  average  over  the  varying  number 


Figure  3.  Average  difference  in  accuracy  for  a  varying  number  of 
positive  bags  (averaged  over  the  varying  number  of  negative  bags) 
for  the  chi-squared  approach  using  the  labelmax  and  label^j  to 
label  the  bags. 


of  negative  bags  and  examine  the  difference  between  the 
methods  as  a  function  of  the  number  of  positive  bags.  Fig¬ 
ure  3  shows  the  average  accuracy  of  chi-squared  minus  the 
accuracy  of  diverse  density  using  both  the  labelmax  and 
labeL,^  approaches.  There  are  two  things  to  note  from 
this  picture.  First,  the  labelmax  approach  completely  domi¬ 
nates  the  labeled  approach.  Knowing  this,  we  only  present 
results  using  the  labelmax  approach  for  the  rest  of  this  pa¬ 
per.  Second,  the  chi-squared  approach  using  labelmax  out¬ 
performs  diverse  density  by  a  few  percent  with  this  differ¬ 
ence  peaking  at  around  four  percent  better  than  diverse  den¬ 
sity. 

5.1.  Parameter  variations 

One  of  the  main  hypotheses  of  this  work  is  that  the  chi- 
squared  method  will  perform  comparably  or  better  than  di¬ 
verse  density  under  a  variety  of  conditions.  With  this  in 
mind,  we  varied  several  parameters  and  conditions  of  the 
experiment  and  compared  the  accuracy  of  the  two  algo¬ 
rithms. 

The  first  parameter  that  we  varied  was  the  number  of  in¬ 
stances  in  each  bag.  The  results  reported  so  far  used  the 
parameters  from  Maron  (1998)  as  we  wanted  to  duplicate 
these  results  for  diverse  density.  In  this  experiment,  we 
fixed  the  number  of  positive  bags  at  20  and  the  number  of 
negative  bags  at  20  and  varied  the  number  of  instances  in 
each  bag  from  5  to  750.  The  reason  that  the  number  of 
instances  in  each  bag  could  make  a  difference  is  that  as 
the  bag  size  decreases  to  one,  the  problem  shifts  from  an 
MI  task  to  a  supervised  learning  task  because  the  inherent 
noise  of  the  task  decreases  (fewer  instances  per  bag  means 
that  the  true  positive  instances  are  easier  to  identify). 


Figure  4.  Average  accuracy  for  diverse  density  and  chi-squared 
with  a  varying  number  of  instances  in  each  bag.  There  were  20 
positive  bags  and  20  negative  bags. 


A  comparison  of  the  accuracy  of  the  two  algorithms  is 
shown  in  Figure  4.  These  numbers  are  averaged  over  50 
different  runs,  where  each  run  has  different  instances  in 
the  bags.  For  bags  with  up  to  100  instances  per  bag,  there 
is  no  difference  between  the  two  methods.  However,  as 
the  bags  continue  to  grow  in  size,  the  signed  chi-squared 
method  significantly  outperforms  diverse  density.  To  ob¬ 
tain  these  results,  we  used  exhaustive  search  which  means 
that  the  differences  are  not  due  to  differences  in  search  abil¬ 
ity  and  depend  only  on  the  definition  of  the  best  concept  for 
each  method.  With  a  constant  number  of  bags,  the  problem 
difficulty  increases  as  the  number  of  instances  in  the  bag 
increases.  These  results  indicate  that  the  chi-squared  ap¬ 
proach  scales  better  with  this  aspect  of  task  difficulty.  We 
hypothesize  that  this  is  due  to  the  labelmax  approach  which 
identifies  the  maximum  match  from  a  bag.  This  should 
minimize  the  effect  of  a  bag’s  size  on  the  algorithms  abil¬ 
ity  to  identify  the  best  concept.  Diverse  density  instead 
multiplies  a  number  of  individual  probability  calculations 
together  over  a  bag  which  means  that  the  larger  bags  may 
obscure  the  data  more. 

As  discussed  above,  one  reason  that  this  data  set  presents  a 
difficult  task  is  that  the  size  of  the  Gaussian  ball  around 
around  the  target  concept  was  smaller  than  the  target 
square.  This  can  be  adjusted  by  varying  a2  and  a2  .  We 
expected  the  two  algorithms  to  perform  comparably  with 
this  parameter  variation  because  both  relied  on  the  same 
underlying  calculation  of  Pr(Bjj  £  c ).  Figure  5  shows  the 
results  of  varying  o2  =  a2  from  0.1  to  5.0  in  increments 
of  0.1.  As  expected,  the  performance  of  the  two  algorithms 
was  comparable. 


Figure  5.  Average  accuracy  for  diverse  density  and  chi-squared 
when  a  varies  from  0. 1  to  5.0. 


5.2.  The  power  of  pruning 

One  of  the  main  advantages  of  the  chi-squared  approach 
over  diverse  density  is  that  it  offers  a  guaranteed  pruning 
method.  We  expected  that  this  would  give  a  significant 
advantage  to  the  chi-squared  method  as  the  task  was  in¬ 
creased  in  difficulty.  To  examine  this  hypothesis,  we  used 
diverse  density  and  chi-squared  on  several  tasks  of  vary¬ 
ing  difficulty.  In  the  first  task,  we  varied  the  dimensions 
of  the  feature  vector  used  to  describe  the  instance  and  the 
concept.  As  the  number  of  dimensions  was  increased,  the 
size  of  the  potential  concept  space  increased  exponentially. 
This  means  that  pruning  should  be  more  useful  as  the  size 
of  the  concept  space  increases. 

One  difficulty  with  varying  only  the  number  of  features  is 
that  this  approach  assumes  that  the  same  proportion  of  data 
is  available  for  each  new  dimension.  This  can  be  accom¬ 
plished  in  two  ways:  by  increasing  the  number  of  bags 
or  by  increasing  the  size  of  the  bags.  If  we  increase  the 
number  of  bags,  the  total  number  of  instances  may  remain 
in  constant  proportion  over  the  varying  dimensions  but  the 
task  is  actually  made  easier  by  having  more  positive  and 
negative  bags  available  as  evidence.  This  is  illustrated  in 
Figure  6  where  we  compare  the  average  accuracy  of  diverse 
density  and  chi-squared  when  each  uses  a  number  of  search 
steps  based  on  the  dimensionality  of  the  task.  We  used  the 
simple  random  search  method  for  both  techniques1.  In  this 
case,  the  chi-squared  approach  significantly  outperformed 
diverse  density  at  the  lower  dimensions  but  as  the  number 
of  bags  increased  with  the  size  of  the  problem,  both  algo¬ 
rithms  began  to  improve  their  accuracies. 

A  second  way  to  vary  the  amount  of  data  available  to  the 

'Both  methods  relied  on  the  same  underlying  code  so  as  to 
ensure  no  difference  in  performance  due  to  coding. 


Figure  6.  Average  accuracy  for  diverse  density  and  chi-squared 
with  a  fixed  number  of  steps  allowed  for  the  search.  The  dimen¬ 
sionality  of  the  feature  space  varies  from  1  to  9  and  the  number  of 
bags  varied  as  a  function  of  the  number  of  dimensions. 


MI  learner  is  to  keep  the  number  of  bags  constant  but  to 
increase  the  number  of  instances  per  bag  as  a  function  of 
the  size  of  the  problem  space.  This  should  significantly  in¬ 
crease  the  difficulty  of  the  problem  as  both  the  size  of  the 
space  increases  and  the  ability  to  identify  the  true  positive 
instances  from  the  positive  bags  decreases.  We  compared 
the  accuracy  of  the  two  evaluation  functions  under  this  con¬ 
dition  and  we  measured  the  total  number  of  steps  used  for 
search  in  the  two  cases.  In  this  case,  the  chi-squared  ap¬ 
proach  is  able  to  achieve  higher  accuracy  than  diverse  den¬ 
sity  in  fewer  steps.  As  the  difficulty  of  the  problem  is  in¬ 
creased  in  both  dimension  and  number  of  instances,  both 
methods  begin  to  drop  in  accuracy  as  well  as  to  take  more 
total  steps  to  find  the  best  concept. 

We  also  compared  the  accuracy  of  the  two  methods  with 
two  features  but  with  a  varying  number  of  instances  per 
bag.  This  corresponds  to  the  results  presented  earlier  (c.f.. 
Figure  4)  but  we  used  search  with  pruning  instead  of  ex¬ 
haustive  search  over  the  entire  space.  These  results  are  pre¬ 
sented  in  Figure  7.  In  this  case,  there  were  20  positive  bags 
and  20  negative  bags  with  the  number  of  instances  in  each 
bag  varying  from  5  to  1000.  The  accuracy  of  the  diverse 
density  approach  drops  very  quickly  as  the  size  of  the  bags 
is  increased  while  the  accuracy  of  chi-squared  drops  more 
slowly  and  levels  off  at  a  better  value. 

Figure  8  shows  the  number  of  search  steps  used  to  identify 
the  best  concept  for  both  diverse  density  and  chi-squared  in 
this  experiment.  Chi-square  takes  almost  a  constant  num¬ 
ber  of  steps  to  find  the  best  concept  while  the  number  of 
steps  required  by  diverse  density  grows  in  an  polynomial 
manner.  This  is  significant  because  it  highlights  one  of  the 
main  advantages  of  the  chi-squared  approach:  the  ability  to 


Figure  7.  Average  accuracy  for  diverse  density  and  chi-squared 
using  search  in  a  two  dimensional  space  where  the  number  of 
instances  in  each  bag  varied  from  5  to  1000. 


Figure  8.  Average  and  median  number  of  search  steps  used  to 
identify  the  best  concept  with  diverse  density  and  chi-squared  in 
a  two  dimensional  space  where  the  number  of  instances  in  each 
bag  varied  from  5  to  1000. 

prune  the  search  effectively  and  accurately. 

6.  Discussion  and  Conclusions 

In  summary,  the  chi-squared  method  that  we  introduced  is 
a  simpler  approach  than  diverse  density  and  it  relies  on  a 
well  known  statistic.  Using  chi-squared  allows  a  guaran¬ 
teed  pruning  mechanism  which  significantly  increases  the 
efficiency  of  the  search  for  the  best  concept.  We  demon¬ 
strated  that  the  signed  chi-squared  approach  has  compara¬ 
ble  or  better  results  than  diverse  density  under  a  varying 
number  of  conditions. 

The  chi-squared  approach  is  very  general  and  can  be  used 
for  a  number  of  different  concept  classes  beyond  those  pre¬ 
sented  here.  Any  concept  class,  differentiable  or  not,  can 


be  used  with  this  approach.  This  means  that  other  concepts 
classes  can  make  use  of  the  pruning  mechanism  described 
here. 

Although  the  contingency  table  that  we  presented  relies  on 
the  bags  having  binary  labels  (e.g.,  positive  and  negative),  it 
could  be  extended  to  real-valued  labels.  This  is  an  impor¬ 
tant  task  as  the  real-valued  labels  can  provide  additional 
information  to  an  MI  learner  (Amar  et  al.,  2001).  A  re¬ 
lated  aspect  of  the  diverse  density  definition  is  the  ability 
to  weigh  the  evidence  in  the  positive  or  negative  bags  dif¬ 
ferently  by  separately  varying  o2  and  o2  .  This  fits  into  the 
signed  chi-squared  framework  in  the  mechanism  for  label¬ 
ing  bags. 
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